#load bin.votes (the voting data)
load("32ndDail_FifthSession_BinaryVotes.Rdata")
#load the data of TDs
parties<-read.csv("TDs_names_parties.csv",row.names = 1)
#make it 0 and 1
votes<-bin.votes-1
In hierarchical clustering procedure, the first thing to be done after loaded the data is setting the distance measurement.
#load the libraries
library(ggplot2) # ggplot
library(tidyverse) # data manipulation
## -- Attaching packages --------------------------------------------- tidyverse 1.3.0 --
## v tibble 3.0.3 v dplyr 1.0.0
## v tidyr 1.1.0 v stringr 1.4.0
## v readr 1.3.1 v forcats 0.5.0
## v purrr 0.3.4
## -- Conflicts ------------------------------------------------ tidyverse_conflicts() --
## x dplyr::filter() masks stats::filter()
## x dplyr::lag() masks stats::lag()
library(cluster) # clustering algorithms
library(factoextra) # clustering visualization
## Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
#use binary method to treat binary data
distance<-dist(votes[1:6],method="binary")
Next, I decided to not use single linkage for this clustering due to the chaining effect. With two options left (Average and Complete), I will use average since it takes all the pairs of points, compute their similarities and then calculate the average of the similarities.
I always start with the smaller number of clustering. So, after using the hierarchical clustering function, I plot it in dendogram, analyze the prospectus number of clusters and cut it, here I found 2 clusters. Then, I made another cut with 3 clusters to compare them.
#running the hclust to make hierarchical clustering
average <- hclust(distance, method = 'average')
#cut the average dendogram into 2 and 3 clusters
cut_average2 <- cutree(average, k = 2)
cut_average3 <- cutree(average, k = 3)
#plot the dendogram with the cut of 2 and 3 clusters
avg_dendo <- as.dendrogram(average)
par(mfrow=c(1,2))
plot(avg_dendo,main="H-2 Clustering")
rect.hclust(average, k = 2)
plot(avg_dendo,main="H-3 Clustering")
rect.hclust(average, k = 3)
#show the number of observations for 2 clusters
a2<-table(cut_average2)
a2
## cut_average2
## 1 2
## 146 10
With 2 clusters, we have a largest cluster with 146 members.
#show the number of observations for 3 clusters
a3<-table(cut_average3)
a3
## cut_average3
## 1 2 3
## 100 46 10
Then, with 3 clusters, the largest cluster is getting smaller with 100 members. Now, let’s see the optimal clusters by elbow and silhoutte method.
#show the optimal number of clusters
fviz_nbclust(votes, FUN = hcut, method = "wss")
fviz_nbclust(votes, FUN = hcut, method = "silhouette")
Based on both graphs, the clustering results with 2 clusters is better than the rest. Because in each graph, there is significant differente between 1 and 2 number of clusters.
Latent class analysis (LCA) can be thought of as a model-based approach to clustering when the recorded data are binary in nature. Polytomous latent class analysis (poLCA) is a clustering method which can be used when variables are categorical.
library(poLCA)
## Loading required package: scatterplot3d
## Loading required package: MASS
##
## Attaching package: 'MASS'
## The following object is masked from 'package:dplyr':
##
## select
library(dplyr)
# models with different number of groups without covariates
f <-cbind(Environment,RentFreeze,SocialWelfare,GamingAndLotteries,HousingMinister,FirstTimeBuyers) ~ 1
# set seed to keep the same result
set.seed(123)
lc1<-poLCA(f, data=bin.votes, nclass=1)
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $Environment
## Pr(1) Pr(2)
## class 1: 0.7564 0.2436
##
## $RentFreeze
## Pr(1) Pr(2)
## class 1: 0.4679 0.5321
##
## $SocialWelfare
## Pr(1) Pr(2)
## class 1: 0.4744 0.5256
##
## $GamingAndLotteries
## Pr(1) Pr(2)
## class 1: 0.5577 0.4423
##
## $HousingMinister
## Pr(1) Pr(2)
## class 1: 0.6603 0.3397
##
## $FirstTimeBuyers
## Pr(1) Pr(2)
## class 1: 0.5385 0.4615
##
## Estimated class population shares
## 1
##
## Predicted class memberships (by modal posterior prob.)
## 1
##
## =========================================================
## Fit for 1 latent classes:
## =========================================================
## number of observations: 156
## number of estimated parameters: 6
## residual degrees of freedom: 57
## maximum log-likelihood: -617.0786
##
## AIC(1): 1246.157
## BIC(1): 1264.456
## G^2(1): 369.0819 (Likelihood ratio/deviance statistic)
## X^2(1): 867.4229 (Chi-square goodness of fit)
##
lc2<-poLCA(f, data=bin.votes, nclass=2)
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $Environment
## Pr(1) Pr(2)
## class 1: 0.345 0.655
## class 2: 1.000 0.000
##
## $RentFreeze
## Pr(1) Pr(2)
## class 1: 0.9739 0.0261
## class 2: 0.1684 0.8316
##
## $SocialWelfare
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.1631 0.8369
##
## $GamingAndLotteries
## Pr(1) Pr(2)
## class 1: 0.3571 0.6429
## class 2: 0.6765 0.3235
##
## $HousingMinister
## Pr(1) Pr(2)
## class 1: 0.9670 0.0330
## class 2: 0.4786 0.5214
##
## $FirstTimeBuyers
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.2652 0.7348
##
## Estimated class population shares
## 0.3719 0.6281
##
## Predicted class memberships (by modal posterior prob.)
## 0.359 0.641
##
## =========================================================
## Fit for 2 latent classes:
## =========================================================
## number of observations: 156
## number of estimated parameters: 13
## residual degrees of freedom: 50
## maximum log-likelihood: -462.0166
##
## AIC(2): 950.0332
## BIC(2): 989.6813
## G^2(2): 58.95794 (Likelihood ratio/deviance statistic)
## X^2(2): 67.89259 (Chi-square goodness of fit)
##
lc3<-poLCA(f, data=bin.votes, nclass=3)
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $Environment
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.2227 0.7773
## class 3: 1.0000 0.0000
##
## $RentFreeze
## Pr(1) Pr(2)
## class 1: 0.9838 0.0162
## class 2: 0.9709 0.0291
## class 3: 0.1397 0.8603
##
## $SocialWelfare
## Pr(1) Pr(2)
## class 1: 0.9943 0.0057
## class 2: 1.0000 0.0000
## class 3: 0.1338 0.8662
##
## $GamingAndLotteries
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.2375 0.7625
## class 3: 0.6647 0.3353
##
## $HousingMinister
## Pr(1) Pr(2)
## class 1: 0.6481 0.3519
## class 2: 1.0000 0.0000
## class 3: 0.4863 0.5137
##
## $FirstTimeBuyers
## Pr(1) Pr(2)
## class 1: 0.9112 0.0888
## class 2: 1.0000 0.0000
## class 3: 0.2506 0.7494
##
## Estimated class population shares
## 0.0803 0.3134 0.6063
##
## Predicted class memberships (by modal posterior prob.)
## 0.0962 0.2949 0.609
##
## =========================================================
## Fit for 3 latent classes:
## =========================================================
## number of observations: 156
## number of estimated parameters: 20
## residual degrees of freedom: 43
## maximum log-likelihood: -451.9819
##
## AIC(3): 943.9637
## BIC(3): 1004.961
## G^2(3): 38.88854 (Likelihood ratio/deviance statistic)
## X^2(3): 35.25544 (Chi-square goodness of fit)
##
lc4<-poLCA(f, data=bin.votes, nclass=4)
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $Environment
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.2234 0.7766
## class 3: 1.0000 0.0000
## class 4: 1.0000 0.0000
##
## $RentFreeze
## Pr(1) Pr(2)
## class 1: 0.1638 0.8362
## class 2: 0.9691 0.0309
## class 3: 0.9760 0.0240
## class 4: 0.1257 0.8743
##
## $SocialWelfare
## Pr(1) Pr(2)
## class 1: 0.0520 0.9480
## class 2: 1.0000 0.0000
## class 3: 0.9512 0.0488
## class 4: 0.1591 0.8409
##
## $GamingAndLotteries
## Pr(1) Pr(2)
## class 1: 0.0000 1.0000
## class 2: 0.2365 0.7635
## class 3: 1.0000 0.0000
## class 4: 0.8764 0.1236
##
## $HousingMinister
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 1.0000 0.0000
## class 3: 0.6677 0.3323
## class 4: 0.3151 0.6849
##
## $FirstTimeBuyers
## Pr(1) Pr(2)
## class 1: 0.2594 0.7406
## class 2: 1.0000 0.0000
## class 3: 0.9123 0.0877
## class 4: 0.2405 0.7595
##
## Estimated class population shares
## 0.1466 0.3137 0.0848 0.4549
##
## Predicted class memberships (by modal posterior prob.)
## 0.1667 0.2949 0.0962 0.4423
##
## =========================================================
## Fit for 4 latent classes:
## =========================================================
## number of observations: 156
## number of estimated parameters: 27
## residual degrees of freedom: 36
## maximum log-likelihood: -440.3629
##
## AIC(4): 934.7259
## BIC(4): 1017.072
## G^2(4): 15.65066 (Likelihood ratio/deviance statistic)
## X^2(4): 14.75873 (Chi-square goodness of fit)
##
lc5<-poLCA(f, data=bin.votes, nclass=5)
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $Environment
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.0227 0.9773
## class 3: 1.0000 0.0000
## class 4: 1.0000 0.0000
## class 5: 0.2141 0.7859
##
## $RentFreeze
## Pr(1) Pr(2)
## class 1: 0.1571 0.8429
## class 2: 0.8614 0.1386
## class 3: 0.1153 0.8847
## class 4: 0.9525 0.0475
## class 5: 1.0000 0.0000
##
## $SocialWelfare
## Pr(1) Pr(2)
## class 1: 0.0703 0.9297
## class 2: 1.0000 0.0000
## class 3: 0.1731 0.8269
## class 4: 0.9559 0.0441
## class 5: 1.0000 0.0000
##
## $GamingAndLotteries
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 1.0000 0.0000
## class 3: 0.4234 0.5766
## class 4: 1.0000 0.0000
## class 5: 0.0584 0.9416
##
## $HousingMinister
## Pr(1) Pr(2)
## class 1: 0.0000 1.0000
## class 2: 1.0000 0.0000
## class 3: 0.8137 0.1863
## class 4: 0.7032 0.2968
## class 5: 1.0000 0.0000
##
## $FirstTimeBuyers
## Pr(1) Pr(2)
## class 1: 0.3157 0.6843
## class 2: 1.0000 0.0000
## class 3: 0.1988 0.8012
## class 4: 0.9092 0.0908
## class 5: 1.0000 0.0000
##
## Estimated class population shares
## 0.2431 0.0472 0.3569 0.1015 0.2512
##
## Predicted class memberships (by modal posterior prob.)
## 0.2692 0.0577 0.3269 0.109 0.2372
##
## =========================================================
## Fit for 5 latent classes:
## =========================================================
## number of observations: 156
## number of estimated parameters: 34
## residual degrees of freedom: 29
## maximum log-likelihood: -438.0703
##
## AIC(5): 944.1407
## BIC(5): 1047.836
## G^2(5): 11.06545 (Likelihood ratio/deviance statistic)
## X^2(5): 9.625534 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
lc6<-poLCA(f, data=bin.votes, nclass=6)
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $Environment
## Pr(1) Pr(2)
## class 1: 0.1397 0.8603
## class 2: 1.0000 0.0000
## class 3: 0.6165 0.3835
## class 4: 1.0000 0.0000
## class 5: 1.0000 0.0000
## class 6: 1.0000 0.0000
##
## $RentFreeze
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.1473 0.8527
## class 3: 0.8559 0.1441
## class 4: 0.2011 0.7989
## class 5: 0.8418 0.1582
## class 6: 0.0000 1.0000
##
## $SocialWelfare
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.0000 1.0000
## class 3: 1.0000 0.0000
## class 4: 0.0000 1.0000
## class 5: 0.8993 0.1007
## class 6: 0.3479 0.6521
##
## $GamingAndLotteries
## Pr(1) Pr(2)
## class 1: 0.1397 0.8603
## class 2: 0.9144 0.0856
## class 3: 0.6165 0.3835
## class 4: 0.2012 0.7988
## class 5: 1.0000 0.0000
## class 6: 0.7328 0.2672
##
## $HousingMinister
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.0000 1.0000
## class 3: 1.0000 0.0000
## class 4: 1.0000 0.0000
## class 5: 0.4635 0.5365
## class 6: 0.6705 0.3295
##
## $FirstTimeBuyers
## Pr(1) Pr(2)
## class 1: 1.0000 0.0000
## class 2: 0.3329 0.6671
## class 3: 1.0000 0.0000
## class 4: 0.2793 0.7207
## class 5: 0.7811 0.2189
## class 6: 0.0000 1.0000
##
## Estimated class population shares
## 0.2314 0.2447 0.116 0.1695 0.0797 0.1587
##
## Predicted class memberships (by modal posterior prob.)
## 0.2885 0.2821 0.0833 0.1667 0.0513 0.1282
##
## =========================================================
## Fit for 6 latent classes:
## =========================================================
## number of observations: 156
## number of estimated parameters: 41
## residual degrees of freedom: 22
## maximum log-likelihood: -434.961
##
## AIC(6): 951.922
## BIC(6): 1076.966
## G^2(6): 4.846814 (Likelihood ratio/deviance statistic)
## X^2(6): 3.946168 (Chi-square goodness of fit)
##
## ALERT: iterations finished, MAXIMUM LIKELIHOOD NOT FOUND
##
#store the result to compare the BIC and AIC
results <- data.frame(Model=c("Model"))
results$Model<-as.integer(results$Model)
## Warning: NAs introduced by coercion
results[2,1]<-c("Model 2")
results[3,1]<-c("Model 3")
results[4,1]<-c("Model 4")
results[5,1]<-c("Model 5")
results[6,1]<-c("Model 6")
results[2,2]<-lc2$bic
results[3,2]<-lc3$bic
results[4,2]<-lc4$bic
results[5,2]<-lc5$bic
results[6,2]<-lc6$bic
results[2,3]<-lc2$aic
results[3,3]<-lc3$aic
results[4,3]<-lc4$aic
results[5,3]<-lc5$aic
results[6,3]<-lc6$aic
Now, I will summarize the result of BIC and AIC into a table. The two most widely used parsimony measures are the Bayesian information criterion, or BIC (Schwartz 1978) and Akaike information criterion, or AIC (Akaike 1973). Preferred models are those that minimize values of the BIC and/or AIC.
#combining results to a dataframe
colnames(results)<-c("Model","BIC","AIC")
lca_results<-results[-1,] #we do not need the first row
lca_results
## Model BIC AIC
## 2 Model 2 989.6813 950.0332
## 3 Model 3 1004.9609 943.9637
## 4 Model 4 1017.0720 934.7259
## 5 Model 5 1047.8358 944.1407
## 6 Model 6 1076.9661 951.9220
In this case, model with 2 clusters has the lowest BIC and the model with 4 clusters has the lowest AIC.
In general, it might be best to use AIC and BIC together in model selection. Here, in selecting the number of latent classes in a model, if BIC points to a two-class model and AIC points to a four-class model, it makes sense to select from models with 2, 3 and 4 latent classes. I choose a model with two clusters (Occam’s razor).
# joining the result of poLCA
vote<-merge(parties,bin.votes,by="row.names",all=TRUE)
lcnew<-as.data.frame(lc2$predclass)
join<-cbind(vote,lcnew)
vote<-merge(parties,votes,by="row.names",all=TRUE)
#print the result from part(a)
I will start the comparison by matching both classes. In part (a), hierarchical clustering with average method gave me 2 clusters and poLCA also gave me 2 clusters. Here is the result from part(a) for hierarchical. I printed the output for the cluster with 10 members and did further investigation on its membership.
a2
## cut_average2
## 1 2
## 146 10
join %>%
filter(Environment=="1") %>%
filter(RentFreeze=="1") %>%
filter(SocialWelfare=="1") %>%
filter(GamingAndLotteries=="1") %>%
filter(HousingMinister=="1") %>%
filter(FirstTimeBuyers=="1")
## Row.names Party Environment RentFreeze SocialWelfare
## 1 Barrett, Seán. FG 1 1 1
## 2 Byrne, Catherine. FG 1 1 1
## 3 Coveney, Simon. FG 1 1 1
## 4 Grealish, Noel. Ind 1 1 1
## 5 Halligan, John. Ind 1 1 1
## 6 Kenny, Enda. FG 1 1 1
## 7 McEntee, Helen. FG 1 1 1
## 8 Moran, Kevin Boxer. Ind 1 1 1
## 9 Murphy, Dara. FG 1 1 1
## 10 Ross, Shane. Ind 1 1 1
## GamingAndLotteries HousingMinister FirstTimeBuyers lc2$predclass
## 1 1 1 1 1
## 2 1 1 1 1
## 3 1 1 1 1
## 4 1 1 1 1
## 5 1 1 1 1
## 6 1 1 1 1
## 7 1 1 1 1
## 8 1 1 1 1
## 9 1 1 1 1
## 10 1 1 1 1
After further analysis about the data for this cluster, all of ten TDs had the same idea and voted YES about all the topics, regardless of their party. It is the behaviour of hierarchical clustering method to generate the cluster based on this similarity, even more when the data is binary.
Now, compare to the result from polCA: Cluster 1 has 56 members while Cluster 2 has 100 members. PoLCA detect the other pattern for clustering that completely different with what hierarchical did.
library(tidyverse)
library(hrbrthemes)
## NOTE: Either Arial Narrow or Roboto Condensed fonts are required to use these themes.
## Please use hrbrthemes::import_roboto_condensed() to install Roboto Condensed and
## if Arial Narrow is not on your system, please see https://bit.ly/arialnarrow
library(viridis)
## Loading required package: viridisLite
library(plotly)
##
## Attaching package: 'plotly'
## The following object is masked from 'package:MASS':
##
## select
## The following object is masked from 'package:ggplot2':
##
## last_plot
## The following object is masked from 'package:stats':
##
## filter
## The following object is masked from 'package:graphics':
##
## layout
dataframe_lc2<-as.data.frame(lc2$probs)
# Load data
data2 <- read.table("lca2.csv", header=T, sep=",")
#I replace this output with Tableau plots for prettier and easier analysis
#install.packages("dendextend")
library(heatmaply)
## Registered S3 method overwritten by 'seriation':
## method from
## reorder.hclust gclus
##
## ======================
## Welcome to heatmaply version 1.1.0
##
## Type citation('heatmaply') for how to cite the package.
## Type ?heatmaply for the main documentation.
##
## The github page is: https://github.com/talgalili/heatmaply/
## Please submit your suggestions and bug-reports at: https://github.com/talgalili/heatmaply/issues
## Or contact: <tal.galili@gmail.com>
## ======================
p2 <- heatmaply(data2[,-1],
dendrogram = "none",
xlab = "", ylab = "",
main = "",
margins = c(60,100,40,20),
grid_color = "white",
grid_width = 0.00001
)
p2
The heatmap shows the clusters are formed based on the clear dissimilarities. Members in cluster 2 give probability of YES closer or equal to 1.00 to RentFreeze, SocialWelfare, HousingMinister and FirstTimeBuyers topics, while the members in cluster 1 give probability of YES closer or equal to 1.00 to Environment topic.
lc2<-poLCA(f, data=bin.votes, nclass=2,graphs = TRUE)
## Conditional item response (column) probabilities,
## by outcome variable, for each class (row)
##
## $Environment
## Pr(1) Pr(2)
## class 1: 1.000 0.000
## class 2: 0.345 0.655
##
## $RentFreeze
## Pr(1) Pr(2)
## class 1: 0.1684 0.8316
## class 2: 0.9739 0.0261
##
## $SocialWelfare
## Pr(1) Pr(2)
## class 1: 0.1631 0.8369
## class 2: 1.0000 0.0000
##
## $GamingAndLotteries
## Pr(1) Pr(2)
## class 1: 0.6765 0.3235
## class 2: 0.3571 0.6429
##
## $HousingMinister
## Pr(1) Pr(2)
## class 1: 0.4786 0.5214
## class 2: 0.9670 0.0330
##
## $FirstTimeBuyers
## Pr(1) Pr(2)
## class 1: 0.2652 0.7348
## class 2: 1.0000 0.0000
##
## Estimated class population shares
## 0.6281 0.3719
##
## Predicted class memberships (by modal posterior prob.)
## 0.641 0.359
##
## =========================================================
## Fit for 2 latent classes:
## =========================================================
## number of observations: 156
## number of estimated parameters: 13
## residual degrees of freedom: 50
## maximum log-likelihood: -462.0166
##
## AIC(2): 950.0332
## BIC(2): 989.6813
## G^2(2): 58.95794 (Likelihood ratio/deviance statistic)
## X^2(2): 67.89258 (Chi-square goodness of fit)
##